iOS MachineLearning 系列(11)—— 自然语言识别与单词分析
在上一篇文章中,我们介绍了使用NaturalLanguage框架来进行自然语言的拆解,可以将一段文本按照单词,句子或段落的模式进行拆解。并且,在进行拆解时,其可以自动的识别所使用的语言。
其实,NaturalLanguage框架本身也提供了语言识别的能力,其可以分析一段文本所对应的语言,同样对于包含多种语言的文本,其可以分析出各种语言的占比。语言识别是其他高级自然语言处理任务的基础,本篇文章还将介绍NaturalLanguage关于文本分析的能力,其能够对文本中的人名,地名和组织名进行识别,也可以对词性进行分析,如动词,名词。甚至我们还可以分析文本的积极或消极程度来推测内容的取向,从而帮助开发者开发出更加智能的应用。
1 - 语言识别
NLLanguageRecognizer类用来进行语言识别,其可以对输入的文本所使用的语言进行推断,使用非常简单。
首先初始化一个NLLanguageRecognizer实例,如下:
1
| let recognizer = NLLanguageRecognizer()
|
可以定义一些示例的字符串来测试识别能力,如:
1 2 3
| let string1 = "世界,你好!" let string2 = "Hello World!" let string3 = "こんにちは中国"
|
调用NLLanguageRecognizer实例的processString方法即可对字符串进行解析,这个方法是同步的,解析完成后,通过dominantLanguage属性即可获取到这段文本所使用的最接近的语言,例如上面的示例字符串中,string1和string2是比较单纯的中文和英文,string3是日语,日语中很多字是和中文一样的,因此对其进行识别可能会出现误差,我们也可以使用languageHypotheses方法来获取可能识别出的语言,返回的结果中会对识别出的每种语言的可信度进行标记。上面的字符串识别效果如下:

其中,zh-Hant为汉语,en为英语,ja为日语。
NLLanguageRecognizer类的使用很简单,其中封装属性和方法列举如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| open class NLLanguageRecognizer : NSObject { open class func dominantLanguage(for string: String) -> NLLanguage? // 对一个字符串进行识别任务 open func processString(_ string: String) // 重置状态 open func reset() // 最近一次识别任务的结果 open var dominantLanguage: NLLanguage? { get } open var languageConstraints: [NLLanguage] public func languageHypotheses(withMaximum maxHypotheses: Int) -> [NLLanguage : Double] }
|
NLLanguag是描述语言的结构体,支持的语言列举如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
| extension NLLanguage { public static let undetermined: NLLanguage public static let amharic: NLLanguage public static let arabic: NLLanguage public static let armenian: NLLanguage public static let bengali: NLLanguage public static let bulgarian: NLLanguage public static let burmese: NLLanguage public static let catalan: NLLanguage public static let cherokee: NLLanguage public static let croatian: NLLanguage public static let czech: NLLanguage public static let danish: NLLanguage public static let dutch: NLLanguage public static let english: NLLanguage public static let finnish: NLLanguage public static let french: NLLanguage public static let georgian: NLLanguage public static let german: NLLanguage public static let greek: NLLanguage public static let gujarati: NLLanguage public static let hebrew: NLLanguage public static let hindi: NLLanguage public static let hungarian: NLLanguage public static let icelandic: NLLanguage public static let indonesian: NLLanguage public static let italian: NLLanguage public static let japanese: NLLanguage public static let kannada: NLLanguage public static let khmer: NLLanguage public static let korean: NLLanguage public static let lao: NLLanguage public static let malay: NLLanguage public static let malayalam: NLLanguage public static let marathi: NLLanguage public static let mongolian: NLLanguage public static let norwegian: NLLanguage public static let oriya: NLLanguage public static let persian: NLLanguage public static let polish: NLLanguage public static let portuguese: NLLanguage public static let punjabi: NLLanguage public static let romanian: NLLanguage public static let russian: NLLanguage public static let simplifiedChinese: NLLanguage public static let sinhalese: NLLanguage public static let slovak: NLLanguage public static let spanish: NLLanguage public static let swedish: NLLanguage public static let tamil: NLLanguage public static let telugu: NLLanguage public static let thai: NLLanguage public static let tibetan: NLLanguage public static let traditionalChinese: NLLanguage public static let turkish: NLLanguage public static let ukrainian: NLLanguage public static let urdu: NLLanguage public static let vietnamese: NLLanguage public static let kazakh: NLLanguage }
|
2 - 文本分析
文本分析支持对单词进行分析,也支持对句子和段落进行分析。针对不同的需求场景,可以使用不同的方案来分析。在NaturalLanguage框架中,使用NLTagScheme结构体来定义分析方案,支持的方案列举如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| extension NLTagScheme { public static let tokenType: NLTagScheme public static let lexicalClass: NLTagScheme public static let nameType: NLTagScheme public static let nameTypeOrLexicalClass: NLTagScheme public static let lemma: NLTagScheme public static let language: NLTagScheme public static let script: NLTagScheme public static let sentimentScore: NLTagScheme }
|
文本分析的结果会被封装为NLTag结构体,此结构体会包含一个字符串类型的原始值,对于lemma,language,script,sentimentScore分析方案,其结果会直接包装成字符串,其他的分析方案的结果则进行了定义,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
| extension NLTag { public static let word: NLTag public static let punctuation: NLTag public static let whitespace: NLTag public static let other: NLTag
public static let noun: NLTag public static let verb: NLTag public static let adjective: NLTag public static let adverb: NLTag public static let pronoun: NLTag public static let determiner: NLTag public static let particle: NLTag public static let preposition: NLTag public static let number: NLTag public static let conjunction: NLTag public static let interjection: NLTag public static let classifier: NLTag public static let idiom: NLTag public static let otherWord: NLTag public static let sentenceTerminator: NLTag public static let openQuote: NLTag public static let closeQuote: NLTag public static let openParenthesis: NLTag public static let closeParenthesis: NLTag public static let wordJoiner: NLTag public static let dash: NLTag public static let otherPunctuation: NLTag public static let paragraphBreak: NLTag public static let otherWhitespace: NLTag }
|
下面,我们来对每种分析方案进行介绍。
tokenType
tokenType方法非常简单,直接对元素类型进行简单分类,效果如下图所示:

lexicalClass
lexicalClass方法相比tokenType更加高级,能够更加细致的单词进行分类,但是需要注意,lexicalClass方案只对英文支持较好。效果如下:

nameType
此方案用来解析文本中的组织名,地名,人名。同样对英文支持较好,如下:

可以看到,其中国家的名字,人名和城市名都正确的解析了出来。
nameTypeOrLexicalClass
此方案无需做过多的解释,只是两种方法的聚合。
lemma
此方案用来分析词干,主要也是针对英文,效果如下:

language与script
这两个方案都是分析元素的语言相关。
sentimentScore
此方案只能用来进行句子和段落的分析,可以推测出文案内容的积极程度,结果越接近1,标明内容的积极性越高,越接近-1表示越消极。例如:

可以看到其对积极和消极的判定还是比较准确,通过测试,目前也只针对英文有效。
最后,我们再来介绍下用来触发文本分析的NLTagger类,在进行分析前,首先需要实例化此类:
1
| let tagger = NLTagger(tagSchemes: [.lexicalClass, .tokenType, .lemma, .nameType, .script, .nameTypeOrLexicalClass, .sentimentScore, .language])
|
此实例化方法中传入的参数表示要支持的分析方案。使用如下代码来触发分析:
1 2 3 4 5
| tagger.string = string tagger.enumerateTags(in: string.startIndex ..< string.endIndex, unit: .paragraph, scheme: .sentimentScore) { tag, range in resultLabel.text = (resultLabel.text ?? "").appending("【[\(string[range])]-[\(tag?.rawValue ?? "")]】") return true }
|
NLTagger类定义如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
| open class NLTagger : NSObject { public init(tagSchemes: [NLTagScheme]) open var tagSchemes: [NLTagScheme] { get } open var string: String? open class func availableTagSchemes(for unit: NLTokenUnit, language: NLLanguage) -> [NLTagScheme] // 输入文本的主语言 open var dominantLanguage: NLLanguage? { get } open func setModels(_ models: [NLModel], forTagScheme tagScheme: NLTagScheme) open func models(forTagScheme tagScheme: NLTagScheme) -> [NLModel] open func setGazetteers(_ gazetteers: [NLGazetteer], for tagScheme: NLTagScheme) open func gazetteers(for tagScheme: NLTagScheme) -> [NLGazetteer] open class func requestAssets(for language: NLLanguage, tagScheme: NLTagScheme, completionHandler: @escaping (NLTagger.AssetsResult, Error?) -> Void) open class func requestAssets(for language: NLLanguage, tagScheme: NLTagScheme) async throws -> NLTagger.AssetsResult // 获取元素所在字符串范围 public func tokenRange(at index: String.Index, unit: NLTokenUnit) -> Range<String.Index> public func tokenRange(for range: Range<String.Index>, unit: NLTokenUnit) -> Range<String.Index> // 对某个位置的元素进行解析 public func tag(at index: String.Index, unit: NLTokenUnit, scheme: NLTagScheme) -> (NLTag?, Range<String.Index>) // 对某个位置的元素进行解析,返回肯能的结果 public func tagHypotheses(at index: String.Index, unit: NLTokenUnit, scheme: NLTagScheme, maximumCount: Int) -> ([String : Double], Range<String.Index>) // 进行完整解析 public func enumerateTags(in range: Range<String.Index>, unit: NLTokenUnit, scheme: NLTagScheme, options: NLTagger.Options = [], using block: (NLTag?, Range<String.Index>) -> Bool) // 进行范围解析 public func tags(in range: Range<String.Index>, unit: NLTokenUnit, scheme: NLTagScheme, options: NLTagger.Options = []) -> [(NLTag?, Range<String.Index>)] // 手动设置语言 public func setLanguage(_ language: NLLanguage, range: Range<String.Index>) public func setOrthography(_ orthography: NSOrthography, range: Range<String.Index>) }
|
其中availableTagSchemes获取到的可用方案不一定准确,有可能是资源未加载,使用requestAssets可以请求资源,如果最终不能支持,可以从其返回的结果判断:
1 2 3 4 5 6 7 8
| public enum AssetsResult : Int, @unchecked Sendable { case available = 0 case notAvailable = 1 case error = 2 }
|
enumerateTags方法中有一个options参数,此参数可以对分析的过程进行配置,支持的配置项如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| public struct Options : OptionSet, @unchecked Sendable { public static var omitWords: NLTagger.Options { get } public static var omitPunctuation: NLTagger.Options { get } public static var omitWhitespace: NLTagger.Options { get } public static var omitOther: NLTagger.Options { get } public static var joinNames: NLTagger.Options { get } public static var joinContractions: NLTagger.Options { get } }
|
完整的示例代码可以在如下地址找到:
https://github.com/ZYHshao/MachineLearnDemo